feat(sinks): partition iceberg exports by day with conversation sort#91
Conversation
Lay out @hypaware/format-iceberg exports for an archive's job, not the cache's: partition by day(primaryTimestampColumn) — a writer-owned default, not the cache's conversation_id-identity cachePartitioning, which sets an unbounded ~1-file-per-conversation floor compaction can't beat — and sort each day partition by the dataset's lookup columns (conversation_id-led) so a conversation lookup prunes row groups by min/max instead of needing a partition per conversation. - Promote partitionSpecForDeclaration + validatePartitionSpecStability (and the declaration type) from src/core/cache/iceberg to a shared src/core/iceberg home, re-exported from src/core/index.js: they are core surface consumed by the registry, cache, plugin types, and now the export (LLP 0003). - format-iceberg derives the day grain + sort order per dataset at commit time, creates the table with both, and rejects partition-spec drift on append (iceberg_partition_spec_drift). Emits hyp_partition_spec and hyp_sort_order on commit spans. - Reframe maintenance compaction: available via icebergRewrite but not run in-daemon and not needed for a day grain (was "blocked by icebird"). Spec: LLP 0022 (rewritten from the abandoned cache-parity decision); xrefs in LLP 0014 and 0003. Tests: 10 (derivation + drift through the real icebird write path) plus a passing iceberg_export_partitioned_local_fs smoke asserting the layout and hyp_partition_spec. Clustering (icebird #22) and read pruning (#20/#21) require a published icebird containing commit 3edb15b; the package.json pin must move off 0.8.5 before those benefits land. The code degrades gracefully on 0.8.5 — partitioning and drift work; the sort order is recorded but inert. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Activates the conversation sort within day partitions and read-side scan pruning that the partitioned export records in metadata. Clears the merge blocker: 0.8.9 contains icebird 3edb15b. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Codex | Reverse partition-drift skipped when partitioning is null on a partitioned table (major; commit.js:99, partitioning.js:29-34) |
Yes — Risks bullet 2; Direct callers (commitBatch guard at commit.js:99); missing test beside iceberg-partitioning.test.js:181 |
| Codex | Lockfile not updated for icebird 0.8.9 → npm ci breakage (major; package.json:41, package-lock.json) |
False positive — package-lock.json is gitignored and untracked in this repo; the stale 0.8.5 lockfile is local environment state, not a PR defect. Residual real risk (validation ran against 0.8.5) is captured in Risks bullet 3, not as a lockfile defect |
| Claude | Conversation sort asserted only via metadata, never via row order or sort_order_id; suite passes on installed 0.8.5 which writes unsorted (major; iceberg-partitioning.test.js, partitioned smoke :533-538) |
Yes — Risks bullet 3; Cross-package usage (shared icebird engine); confirmed locally: node_modules/icebird resolves 0.8.5 at PR head |
| Claude | hyp sink maintain CLI text still says "compaction not supported by icebird", contradicting the PR's reframing (minor; core_commands.js:2336-2339, 2251-2253, 2326) |
Yes — Direct callers (compactionSupported consumer at core_commands.js:2326,2336-2339 confirmed unchanged); consistency gap only, not counted as a risk bullet |
| Claude | Stray </content> artifact at end of LLP 0022 (minor; llp/0022-iceberg-export-partitioning.spec.md:270) |
No — doc-only; touches no code surface mapped above (confirmed present in the diff at the file's final line) |
Codex review
Fix Validations
No bug-fix validations. The claimed icebird dependency update is reviewed below as a release-safety finding because the lockfile contract is incomplete.
Findings
2) Contract & Interface Fidelity
- Severity: major
- Confidence: high
- Evidence: hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:29, hypaware-core/plugins-workspace/format-iceberg/src/partitioning.js:34, hypaware-core/plugins-workspace/format-iceberg/src/commit.js:99
- Why it matters: If a dataset stops deriving partitioning for an already day-partitioned export table, the drift guard is skipped and the commit span reports
unpartitionedeven though the existing table still has a partition spec. - Suggested fix: On append, always inspect the existing partition spec; if
input.partitioningis null andcurrentPartitionSpec(metadata).fields.length > 0, throwiceberg_partition_spec_driftor explicitly preserve/report the existing layout. Add the reverse-drift test beside the existing unpartitioned-to-partitioned drift test at test/plugins/iceberg-partitioning.test.js:181.
8) Release Safety
- Severity: major
- Confidence: high
- Evidence: package.json:41, package-lock.json:15, package-lock.json:774
- Why it matters: The manifest requests
icebird0.8.9, but lockfile installs still resolve 0.8.5, sonpm ciwill either fail or run without the sort-on-write/read-pruning dependency this PR requires. - Suggested fix: Update and commit
package-lock.jsonforicebird@0.8.9, then rerunnpm testand the partitioned smoke from a clean install path.
No Finding
- Behavioral Correctness
- Change Impact / Blast Radius
- Concurrency, Ordering & State Safety
- Error Handling & Resilience
- Security Surface
- Resource Lifecycle & Cleanup
- Test Evidence Quality
- Architectural Consistency
- Debuggability & Operability
Evidence Bundle
- Changed hot paths:
exportDatasetlayout derivation and span attributes;commitRowStreampartitioning propagation;commitBatchcreate/drift behavior; promoted core partition-spec helper;icebirddependency pin. - Impacted callers: hypaware-core/plugins-workspace/format-iceberg/src/table-format.js:202, hypaware-core/plugins-workspace/format-iceberg/src/table-format.js:302, hypaware-core/plugins-workspace/format-iceberg/src/commit.js:191, src/core/cache/iceberg/store.js:105, src/core/cache/iceberg/store.js:126.
- Impacted tests: test/plugins/iceberg-partitioning.test.js:65, test/plugins/iceberg-partitioning.test.js:146, test/plugins/iceberg-partitioning.test.js:181, test/plugins/iceberg-partitioning.test.js:211.
- Unresolved uncertainty: I did not run tests; review was limited to the supplied diff plus targeted tracing.
Claude review
Claude review
Conversation sort is asserted only as metadata — no test verifies rows are actually sorted
- Severity: major
- Confidence: 90
- Evidence: test/plugins/iceberg-partitioning.test.js:1421-1423; hypaware-core/smoke/flows/iceberg_export_partitioned_local_fs.js:533-538
- Why it matters: Both the unit test and the smoke assert only that
sort-orders/default-sort-order-idmetadata was recorded — neither reads a data file back to assert conversation rows are actually ordered, nor checks the per-data-filesort_order_id; the gap is demonstrably real because the full suite passes against the currently installed icebird, which hardcodessort_order_id: 0and writes rows unsorted (nosrc/write/sort.js), so a missing or regressed sort-on-write in the dependency is invisible to tests and smokes. - Suggested fix: In the
commitBatchintegration test or the partitioned smoke, read one day-partition parquet file back with hyparquet and assert theconversation_idcolumn is non-decreasing, and/or assert the manifest entry'ssort_order_id === 1; re-run tests and the iceberg smokes against the published icebird 0.8.9 (a freshnpm install), per LLP 0022's own landing requirement.
hyp sink maintain CLI text contradicts the PR's own compaction reframing
- Severity: minor
- Confidence: 85
- Evidence: src/core/cli/core_commands.js:2336-2339 (also 2251-2253, 2326)
- Why it matters: The PR redefines
compactionSupported: falseto mean "not run by this sink (out-of-band only), not impossible" (maintenance.js docstring, smoke message, and LLP 0022#compaction all updated), but the user-facing CLI still prints "compaction not supported by icebird — data-file rewriting will be available in a future release" and labels the actioncompaction_unsupported, which is factually false under the icebird 0.8.9 pin this PR lands. - Suggested fix: Update the
runSinkMaintainsummary line, thecompaction_unsupportedaction label, and the docstring at core_commands.js:2251-2253 to the "not run by this sink — out-of-band only (LLP 0022)" framing, matching the already-updated smoke assertion.
Stray </content> artifact committed at the end of LLP 0022
- Severity: minor
- Confidence: 90
- Evidence: llp/0022-iceberg-export-partitioning.spec.md:270
- Why it matters: A literal
</content>paste artifact is committed as the final line of the new spec document, which the repo's LLP tooling (/ref-check,/ref-story) parses and CLAUDE.md's living-docs discipline exists to keep clean. - Suggested fix: Delete the trailing
</content>line from llp/0022-iceberg-export-partitioning.spec.md.
Cross-checked and dropped: three reviewers flagged "package-lock.json not updated for the icebird 0.8.9 bump" (rated up to blocker) — verified false positive: package-lock.json is gitignored in this repo (.gitignore:9) and untracked, so the stale 0.8.5 lockfile is local environment state, not a PR defect; a fresh install resolves the exact 0.8.9 pin from package.json. The residual real risk (local validation ran against a non-0.8.9 engine) is folded into the sort-coverage finding above.
Reports: .git/dual-review/pr-91
… assertions, CLI text - commitBatch now rejects reverse partition-spec drift: appending with no derived partitioning onto an already-partitioned table throws iceberg_partition_spec_drift instead of silently skipping the guard (and mislabeling spans as unpartitioned). LLP 0022#drift-rejection updated to record the guard as bidirectional; new test alongside the forward-drift case. - The conversation sort is now asserted on disk, not just in metadata: both the commitBatch integration test and the partitioned smoke read a day-partition parquet file back with hyparquet and assert conversation_id row order — fails on an icebird that records the sort order but writes unsorted. Verified against icebird 0.8.9. - hyp sink maintain CLI text reframed to match maintenance.js / LLP 0022: compaction is not run by this sink (out-of-band via icebergRewrite), not 'unsupported by icebird'; action label is now compaction_out_of_band. - Removed stray </content> artifact from the end of LLP 0022. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review fixes pushed — 4290b09Addressed all four actionable findings from the dual-agent review (the lockfile finding was already marked a false positive — Reverse partition-drift (Codex, major) — Sort asserted only as metadata (Claude, major) — both the
Stray 🤖 Generated with Claude Code |
🧭 Decision map — where to spend your attentionCompanion to the dual-review verdict. This casts no verdict — it points at the 6 forks where the author made a real choice, so you can skim the rest. Scanned: 33 hunks across 21 files. Most is mechanical: the verbatim promotion of 1. Existing tables hard-fail on upgrade — drift is rejected, not migrated · unhappy-path policy
throw newError(
'iceberg_partition_spec_drift',
`iceberg-format: partition spec drift at '${input.tableUrl}': ${message}`
)
2. icebird 0.8.5 → 0.8.9 — one line that swaps the shared write engine · new dependency"icebird": "0.8.9",
3. Partition axis = writer-owned
|
Summary
Lay out
@hypaware/format-icebergexports for an archive's job, not the cache's:day(primaryTimestampColumn)— a writer-owned default, derived independently of the cache'scachePartitioning. The cache partitions byconversation_id:identity, which sets an unbounded ~1-file-per-conversation floor that compaction can't beat. Day grain bounds file count by time (~dozens of multi-MB files/day) and prunes on time-range predicates.conversation_id-led, from its declared identity columns). A conversation lookup then prunes data files / row groups byconversation_idmin/max — preserving lookup speed without the file-count cost of partitioning on it.Rationale, the measurement that drove it, and the abandoned "inherit
cachePartitioning" decision it replaces are in LLP 0022.✅ Resolved — icebird pin (was merge blocker)
The clustering (#22) and read-pruning (#20/#21) benefits required a published icebird containing commit
3edb15b("Scan pruning and sort-on-write").3edb15b— released as0.8.9package.jsonpin0.8.5→0.8.9(shared-engine bump — the cache rides the same icebird and gains read-pruning for free)Re-verified on
0.8.9:npm test(809 pass / 0 fail) andnpm run smoke -- iceberg_export_partitioned_local_fsboth pass. The sort order recorded in metadata is now active on write.What's in here
Docs
llp/0022rewritten to the day-grain + sort decision; xrefs added tollp/0014(sinks) andllp/0003(the helper promotion is core surface).Core
partitionSpecForDeclaration+validatePartitionSpecStability+ the declaration type out ofsrc/core/cache/icebergto a sharedsrc/core/iceberg, re-exported fromsrc/core/index.js. Move + re-export; cache rewired; behavior identical.format-iceberg
partitioning.jsderives the day grain (fromprimaryTimestampColumn) + sort order (from the dataset's identity columns) per dataset at commit time.commit.jscreates the table withpartitionSpec+sortOrderand rejects partition-spec drift on append (iceberg_partition_spec_drift).table-format.jsemitshyp_partition_spec+hyp_sort_orderon the commit span.maintenance.jscompaction reframed: available viaicebergRewrite, but not run in-daemon and not needed for a day grain (was "blocked by icebird").Test plan
npm test— 809 pass / 0 fail (10 new:derivePartitioningderivation + drift, exercised through the real icebird write path viacommitBatchover a real local-fs BlobStore — day partition, conversation sort, 2-files-for-2-days, drift rejected, no false drift).npm run smoke -- iceberg_export_partitioned_local_fs— passes: drives the production sink, assertsday(message_created_at)spec +conversation_id-led sort, 4 rows → 2 day-partition files, andhyp_partition_spec/hyp_sort_orderon the span./ref-check— 74@refannotations, 0 broken.Note: pre-existing spool bug (not introduced here)
The existing
iceberg_export_local_fssmoke fails in this environment becausestorage.appendRows→flushTable({force:true})→readRowsyields 0 rows. Confirmed to reproduce on pristine master (changes stashed) and on icebird 0.8.5 — i.e. independent of this work and of the icebird version. The new partitioned smoke sidesteps it by populating the cache via the direct write path. Worth a separate investigation.🤖 Generated with Claude Code